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Abstract — This paper proves the separation between source- 
network coding and channel coding in networks of noisy, discrete, 
memoryless channels. We show that the set of achievable distor- 
tion matrices in delivering a family of dependent sources across 
such a network equals the set of achievable distortion matrices 
for delivering the same sources across a distinct network which 
is built by replacing each channel by a noiseless, point-to-point 
bit-pipe of the corresponding capacity. Thus a code that applies 
source-network coding across links that are made almost lossless 
through the application of independent channel coding across 
each link asymptotically achieves the optimal performance across 
the network as a whole. 

I. Introduction 

In his seminal work |l], Shannon separates the problem of 
communicating a memoryless source across a single noisy, 
memoryless channel into separate lossless source coding and 
channel coding problems. The corresponding result for lossy 
coding in point-to-point channels is almost immediate since 
lossy coding in a point-to-point channel is equivalent to loss- 
less coding of the codeword indices, and it appears in the same 
work [1|. For a single point-to-point channel, separation holds 
under a wide variety of source and channel distributions (see, 
for example J2] and the references therein). Unfortunately, 
separation does not necessarily hold in network systems. Even 
in very small networks like the multiple access channel 0, 
separation can fail when statistical dependencies between the 
sources at different network locations are useful for increasing 
the rate across the channel. Since source codes tend to destroy 
such dependencies, joint source-channel codes can achieve 
better performance than separate source and channel codes 
in these scenarios. 

This paper proves the separation between source-network 
coding and channel coding in networks of independent noisy, 
discrete, memoryless channels (DMC). Roughly, we show that 
the vector of achievable distortions in delivering a family of 
dependent sources across such a network M equals the vector 
of achievable distortions for delivering the same sources across 
a distinct network M. Network JV is built by replacing each 
channel p(y\x) in N by a noiseless, point-to-point bit-pipe 
of the corresponding capacity C = maxp^) I(X; Y). Thus a 
code that applies source-network coding across links that are 
made almost lossless through the application of independent 
channel coding across each link asymptotically achieves the 



optimal performance across the network as a whole. Note 
that the operations of network source coding and network 
coding are not separable, as shown in H and J5| for non- 
multicast and multicast lossless source coding, respectively. 
As a result, a joint network-source code is required, and only 
the channel code can be separated. While the achievability 
of a separated strategy is straightforward, the converse is 
more difficult since preserving statistical dependence between 
codewords transmitted across distinct edges of a network of 
noisy links improves the end-to-end network performance in 
some networks J6). 

The results derived here give a partial generalization of Q, 
|[8l and J6), which prove the separation between network 
coding and channel coding for multicast Q, |E) an d general 
demands |6), respectively, under the assumption that messages 
transmitted to different subset of users are independent and 
are uniformly distributed. The shift here is from independent 
sources to dependent sources, from lossless to lossy data 
description, and from memoryless to non-memoryless sources. 

The remainder of the paper is organized as follows. Sec- 
tions [El] and [Hi] describe the notation and problem set-up, 
respectively. Section [TV] describes a tool called a stacked 
network that allows us to employ typicality across copies of a 
network rather than typicality across time in the arguments that 
follow. Section [V] gives our main results for both memoryless 
sources and sources with memory. 

II. Notation 

Calligraphic letters, like X, y, and hi, refer to sets, and the 
size of a set A is denoted by \A\. For a random variable X, 
its alphabet set is represented by X. 

While a random variable is denoted by X, X_ represents a 
random vector. The length of a vector is implied in the context, 
and its £ th element is denoted by X(£). 

For two vectors x_ x and x 2 of the same length r, ||^ 1 — 
x 2 ||i denotes the t\ distance between the two vectors defined 
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probability distributions, i.e., ^ £i(«) = J2 £2(2) = 1 an d 
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distance between x x and x 2 is defined as \\gc^ — x 2 ||tv = 
0.5 H^ -x 2 \\i. 

Unlike 0, this paper uses strong typicality arguments 
to demonstrate the equivalence between noisy channels and 
noiseless bit-pipes of the same capacity. We therefore assume 
that the channel input and output alphabets are finite. The 
alphabets for the sources described across the channel may be 
discrete or continuous. 

III. The problem setup 

Consider a multiterminal network Af consisting of m nodes 
interconnected via some point-to-point, independent DMCs. 
The network structure is represented by a directed graph 
G with node set V = {1, ...,m} and edge set £. Each 
directed edge e = [n, V2] G £ implies a point-to-point 
DMC between nodes V\ (input) and t> 2 (output). Each node 
a observes some source process U( a ) = {U^ }^ =1 , and is 
interested in reconstructing a subset of the processes observed 
by the other nodes. The alphabet of source \J^ a \ U^ a \ 
can be either scalar or vector-valued. This allows node a 
to have a vector of sources. For achieving this goal in a 
block coding framework, source output symbols are divided 
into non-overlapping blocks of length L. Each block is de- 
scribed separately. At the beginning of the j th coding period, 
each node a has observed a length-L block of the process 
U<«>, Le., U^ L+1 = (U$_y L+1 ,...,U$). The blocks 
{U/ a _i) L+1 }aev observed at different nodes are described 
over the network in n uses of the network (The rate k = — 
is a parameter of the code). For those n time steps, at each 
step t G {1, . . . , n}, each node a generates its next channel 
inputs as a function of U^_ l - ]L+l and its channels' outputs up 

to time t - 1, here denoted by yM.*- 1 = (Y} a) , ..., Y t ( "\), 
according to 



X (a) . (y(o)jt-l x U (a),L ^ x (a) _ 
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Note that each node might be the input to more than one chan- 
nel and/or the output of more than one channel. Hence, both 
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Xj: ' and Y t y ' might be vectors depending on the indegree and 
outdegree of node a. The reconstruction at node b of the block 
observed at node a is denoted by Jj( a ^ b )> L . This reconstruc- 
tion is a function of the source observed at node b and node 
&'s channel outputs, i.e., fj( a -* b )> L = #(«-»•&) (y(&).» 5 U^' L ), 
where 
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The performance criterion for a coding scheme is its induced 
expected average distortions between sources and reconstruc- 
tion blocks, i.e., for all a, b G V 
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where d< a ^ 6 ) : U [a) x U^ a ^ b ) ->• R+ is a per-letter distortion 
measure. As mentioned before U^ and U^ a ^ b ^ are either 
scalar or vector-valued. This allows the case where node a 



observes multiple sources and node b is interested in recon- 
structing a subset of them. Let 



max 
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If node b is not interested in reconstructing node a, then 

The distortion matrix D is said to be achievable at a rate k 
in a network A/", if for any e > 0, there exists a pair (L, n), 
L/n = k, and block length n coding scheme such that 



Ed^ b) (U {a) > L ,U^ b) > L )<D(a,b) + e, 



(3) 



for any a, b G V. 



IV. Stacked network 

For a given network Af, the corresponding A-fold stacked 
network N_ is defined as N copies of the original network 
|6). That is, for each node and each edge in Af, there are N 
copies of the same node or same edge in A/\ At each time 
instance, each node has access to the data available at nodes 
which are its copies, and potentially uses this extra information 
in generating the channel inputs of the future time instances. 
Likewise, in decoding, all N copies of a node can collaborate 
in reconstructing the signals. This is made more precise in the 
following two definitions 



( Q ) . ( V (a)\t-1 v7 /(a),AfL 



20' ■■ (.y 



X 



(«) 



(4) 



and 



r y(a^b)NL n;Wvl x u(b) , NL ^ ^(a^ML^ (5) 



u 



(y ( - 



which correspond to (Q]) and (0 in the original network. In 
© and (0 all the vectors are of length N. 

In an AMayered network, the distortion between the source 
observed at node a and its reconstruction at node b is defined 

as 

D N (a,b) =E \d { ^ b) (U^ b) ' NL ,U {a ^' NL )] , (6) 

for any a, b G {1, . . . , m}. 

A distortion matrix D is said to be achievable in the stacked 
network at some rate k if for any given e > 0, there exist N, n 
and L large enough, such that D^(a, b) < D(a, b) + e, for all 
a, b G {1, . . . , m}. Note that the dimension of the distortion 
matrices in both single layer and multi-layer networks is m x 
to. Let T>(k,M) and T>„ (k, Af) denote the closure of the set of 
achievable distortion matrices at some rate k in a network Af 
and its stacked version Af_ respectively. The following theorem 
establishes the relationship between the two sets. 

Theorem 1: At any rate k, 



V( K ,Af)=V s { K ,AL)- 



(7) 



Proof: 
i. Proof of T>(k,AT) C V s (k,AT). Consider any D G 
mt(T>(n,Af)). Then for any e > 0, there exists a coding 
operating scheme at rate k = L/n on Af such that 
(f3) is satisfied. For any N, a stacked network that uses 



this same coding strategy independently in each layer 
achieves 



nd { ^ b) (U ia -* b) < NL , tj(^b),NL^ 
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ii. V s (k,AQ C V{n,Af). Let D G int(P s (K, AT)). Since 
D G int(I? s (K, AT)), for any e > 0, there exists integers 
TV, n, and L such that a stacked network consisting of 
TV layers along with a block length n coding scheme for 
L source symbols on this stacked network achieves 



E 



d ^\lj{^b),NL^ fj(a^b),N L) \ < D(a) b) + £) 



for all a, b G V. The same coding scheme can be used 
in a single-layer network as follows. Consider a single 
layer network where each node observes a length-iVX 
block of source symbols and describes the block in the 
next Nn time steps. At times t € {1, . . . , TV}, each node 
a sends what would have been sent at time 1 by node 
a in layer t of the stacked network. After that, having 
collected the output of the previous TV time steps, at 
times t G {TV + 1, . . . , 2TV}, node a sends the outputs 
of the same node at time 2 in layer t — TV (Note that 
in the first TV time steps, node a's output is only a 
function of its own source, not the channels' outputs. 
It only collects the channel outputs in order to use them 
during the next TV time steps.). The same strategy is used 
in n time intervals, each comprising TV network uses. 
During each period, the new channel outputs observed 
by node a are recorded to be used in the future periods, 
but do not affect the next inputs generated by that node 
during that time period. Using this strategy, at the end 
of nTV channel uses, each node's observation has exactly 
the same distribution as the collection of observations 
of its TV copies in the stacked networks. Therefore, 
applying the same decoding rule will result in the same 
performance. Hence, D G T>(k,AT). 



V. Replacing a noisy channel with a bit pipe 

A. Memoryless sources 

In this section we assume all sources are jointly 
i.i.d., i.e., for any k > 1, P(U^' k ,.. . ,U^' k ) = 

nP([/f ) ,.-,^" ,) ), where P(£/f \ . . . , u\ m) ) does not 

i=l 

depend on i. Note that at each time instant the sources might 
be correlated with each other. 

In the described network Af, for some a, b G V such that 
[a, b] G £, consider the noisy channel connecting these two 



nodes. The channel is described by its transition probabil- 
ities {p(y\x)} xl =x .yey* an d nas some finite capacity C = 
m&xI(X; Y). Now consider a network Af' which is identical 

p(x) 

to Af except for the noisy channel between a and b, which is 
replaced by a bit-pipe of capacity C. 
Theorem 2: For any n > 0, 

V(K,Af)=V(n,Af'). (9) 

Proof outline: By Theorem [T] the achievable region of a 
network is equal to the achievable region of its stacked version. 
Hence, it suffices to prove that T> s (k,AT) = T> s (k,AT_). 
i. V s (k,AT) C T> 8 (k,AP): Let D G 'mt(p 8 (K,AL% We 
need to show that D G T> s (k,AT) as well. Note that 
Af and Af' are identical except for the DMC connecting 
nodes a and b in Af which is replaced by a bit-pipe of 
capacity C in Af'. We next show that any code for Af_ can 
be operated on Af_ with a similar expected distortion. Let 
the number of layers in both networks be TV. Given the 
capacity of the bit-pipes, the number of bits that can be 
carried from a to b in AT is at most NR, where R < C. 
Hence, if TV is large enough, the same information can 
be transmitted from a to b in Af_ by doing appropriate 
channel coding across the layers over the noisy channel 
and its copies connecting a and b in Af. Let P e r a -*b) 
denote the probability of error of the channel code op- 
erating over the channel corresponding to the edge [a, b] 
and its copies in Af_, and let P e ,max = rnax[ a m £ £ P e>a ,_>.6. 
Then the extra expected distortion introduced at each 
reconstruction point is bounded above by \£\P e ,ma,xd ma , x 
and can be made arbitrarily small, 
ii. T>(k,AT) C V a (K,AT): Let D G int(D(«,A0). We 
prove that D G T) s (k,AT). Consider a code defined on 
Af that achieves within e of D, and consider the TV-fold 
stacked version of Af, Af.. Assume that the same code 
is applied independently in each layer. We first show 
that, when all sources are memoryless and uniformly dis- 
tributed, the performance of the code given the realization 
of (X_i,Y_i) only depends on the empirical distribution 
of (2Li,Hi) defined as 



iV 



PiZiZLifay) = — V 



i=i 



(10) 



for all x € X and y G y. Here the subscript 1 refers 
to time t = 1. After establishing this, we use the result 
proved in [5] and show that at time t = 1 we can simulate 
the performance of the noisy link by a bit-pipe of the 
same capacity. For the rest of the proof, let U = {Ui} 
and U = {Ui} denote some i.i.d. source observed at some 
node in V and its reconstruction at some other node in 
V. 
In the original network, 

Ed L ([/ L ,J7 L ) = ^E[d L ([/ L ,C/ L )|(X 1 ,y 1 ) = (x, 2 /)" 

x£X 

yey 
xP((X u Y 1 ) = (x,y)). (11) 



On the other hand, in the iV-fold stacked network, 

E[d NL (U NL ,U NL ) 
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= J2 E [MU L ,U L ) \(X 1 ,Y 1 ) = (x,y) 

x£X 

y&y 

Comparing (fTTT i and (fT2l) reveals that the desired result 
will follow if we can find a coding scheme for which, 

|P((Xi,yi) = (x,y)) -E^^Cs,!/)]! , (13) 

can be made arbitrary small. 

To prove this, consider a channel with input drawn i.i.d. 
from some distribution p(x). The encoder observes N 
source symbols and sends a message of NR bits to 
the decoder. The decoder converts these NR bits into 
a reconstruction block Y_ = (Yi, . . . , Yiv). The empirical 
joint distribution between the channel input and channel 
output induced by the bit pipe is defined in the classical 
sense as follows 



P\x 



1 N 



(X(t),Y(t))=(<»,v)- 



Consider a DMC described by transition probabilities 
{p(y\x)} X £x ,y<=y whose input is an i.i.d. process dis- 
tributed according to some distribution p(x). In J5], it 
is shown that, as long as R > I(X;Y), any such 
channel can be simulated by a bit pipe of rate at most 
R such that the total variation between P[x,y] ( x j V) an d 
p(x,y) = p(x)p(y\x) can be made arbitrarily small for 
large enough block lengths. In other words, there exists 
a sequence of coding schemes over the bit-pipe such that 



\P\2U0 -P\\ 



a.s. 



(14) 



(where P\x,Y] an d P are vectors describing distributions 

(p\x,Y\{x,y) ■■ x,y e X,y) and (p(x,y) : x,y G X,y) 

respectively.) 

Combining this result with our initial claim yields the 

desired result, i.e., at time t — 1, we can replace the noisy 

link by a bit-pipe. To extend this result to the next n — 1 



time steps, we use induction. Note that in the original 
network 



Ed L (U L ,U L ) 
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t=l,...,n 



d L (U L ,U L ) 



f){{X t ,Y t ) = (&,&)} 



i--i 



xP((X n ,Y n ) = (x n ,y n )). 



(15) 



On the other hand, using the same analysis used in 
deriving (fl"2l) . in the A^-fold stacked network, 



E[d NL (U NL ,U NL ) 



x t £X 

y t ey 

t=l,...,n 



J2 E d L (U L ,U L )\(X n ,Y n ) = (x",y n ) 
{l:(X t {£),Y?W) = (x t ,y t )} 



x E 



(16) 



Therefore, we need to show that by appropriate coding 
over the bit-pipes, 



P((X n ,Y n ) = (x n ,y n )) 

{l:{X\t),Y\t)) = {x\y t )}\ 



-E 



(17) 



can be made arbitrary small. Note that 

n 

P((X n ,Y n ) = (x n ,y n )) = U 

t=\ 
P {(X u Y t ) = (xt,yt) \(X*-\Y*- 1 ) = (x*- 1 ,!/*- 1 )) , 

(18) 



and 



\{t:(X n (£),Y?(t)) = {x n ,y n )}\ 



n 



\{£:(X t (£),Y t (e)) = (x\y t )}\ 

rt-lf 



\\{t:(X t - 1 (l),Y t - 1 (£)) = (xt-\yt-i)}\' 

(19) 

where for t = 1 

\{£ : (X*- 1 ^).^" 1 ^) = (x*- 1 ,^- 1 )}! = L. 

We have already proved that by appropriate coding, we 
can make the first term in (fT9b converge to the first 
term in (fT8l with probability one. By induction, we can 
prove that the same result is true for any other term in 
( |T9b and its corresponding term in (fT8l . After proving 
this, since all the terms in ([T9T i and as a result their 
product are positive and upper-bounded by 1, we can use 
the Dominated Convergence Theorem (see, for example, 
ifTUl ) to show that dTD i can be made arbitrary small. 
To apply induction, assume there exist some coding 
schemes by which we make the first t — 1 terms in ( [T9l > 



each converge to the corresponding term in < fT8l > almost 
surely. Using this assumption, we prove that the same 
thing is true for the i th term as well. 
Note that when the first t — 1 terms are very 
close, the frequency of occurrence of each pattern 

{(Z t_1 W,Z* _1 W) = O*" 1 ^*" 1 )} across the layers 
in Af_ is very close to the pattern's probability. Since 
the two networks perform the same except for link 
[a, b], the network guarantees that the frequency of 

{QC*W,H* -1 C0) = OW -1 )} is also close t0 its 

probability in M_. In order to finish the proof, we use 
Lemma Q] proved in Appendix 1 . 

Lemma 1: If we choose the random codes used at times 
t — 1 and t independently, then 

E[ly.(i)= 1 J(X*_ 1 (l),y*-i(l)) = (xt-i,Vt-i), 
X t (l) =x t ]=P (Y t (l) = y t \X t {\) = x t ) , (20) 

where the expectation is both with respect to the network 
and the code selections. 

■ 

B. Sources with memory 

Assume that the sources are no longer memoryless but 
mixing. That is for any integers k and T 

p(([/ (i) < fe ,...,c/ (m) ' fc ,[4 



But 



r (l),T+k 
'T i 



,4 m) - T+fe ) 



.M m) ' k ,uP' T+k , 



■ , ""r 



(m),T+k- 



P ((C/ (1) ' fe , . . . , U {m) > k ) = (u (1) ' fc , . . . , u (m) ' fe 



>)* 



II 



{m),T+k- 



P ( (Ut ' ,...,U T ' ) — (u T ,.,.,u, T 

goes to as T approaches oo. In the proof of Theorem [2] 
we used the fact that the sources are correlated and jointly 
i.i.d. to conclude that the inputs to the copies of a channel in 
the stacked network are i.i.d. If the sources have memory, this 
does not hold any more. But, if we assume that the sources 
are mixing, then for block length L large enough, the two 
sets {U L , E/f£ +1 , . . •} and {U 2 L L +11 U^ +1 , . . .} look like two 
i.i.d. sequences. Therefore, in the stacked network, if we code 
the even-numbered layers together and the odd-numbered ones 
together, such that each one is done separate from the other 
one, we get back to the i.i.d. regime and can prove a similar 
result. 

Appendix A: Proof of LemmaQ] 
Note that 

E [l Zt(1)= JXt-i(l) = it-i,Z*_i(l) - Vt-uX t {\) = x t ] 
= E P (Z*(l) = 1ft, 2E*-i (2 :#)=&, 

F t _ 1 (2 : N)=l 1 ,X t {2 : N) = s 2 \X t _ x (l) = st-i, 
Y t _ 1 (l)=y t - 1 ,X t (l)=x t ) 
= ^P(Zt(l) = Vt\X t = [x t ,s 2 })P(X t (2 : N) = s 2 \ 



P(Y t (l) = b\X t =x t ) 



E 



l 

P(X t =x t ) 



P(X t = x t ,Y t = y t ) 
P(X t - x t ) 



(A-2) 



E P (X t (l) = x t (l)): 



Y t :Y t (l)=b 

P(Y t (l)=y t \X t (l)=x t (l))x 

P(X t (2 : N) = x t (2 : N)\X t (l) = x t (l),Y t (l) = b)x 
P(Zt(2 : N) = y t (2 : N)\X t = x t ,Y t (l) = j^(l)) (A-3) 
1 



PQf*=s*) 



E P(X t (l) = x t (l)): 



P(Y t (l) = y t \X t (l)=x t (l))x 

P(X t (2 : JV) = z*(2 : iV)|X t (l) - £t (l))x 

P(F t (2 : TV) - y t (2 : iV)|X t = x t ,y_ t (l) = b) (A-4) 

= P(F 4 (l) = 6|X 4 (l)=x t (l))x 

E P(Zt(2 : N) - y t (2 : iV)|X t = x t) F t (l) = b) 
y t i l )=yt 
= P{Y t {l) = b\X t {l) = x t {l)). (A-5) 

Combining ( lA-lb and ( IA-51 ) yields the desired result. 
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